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ABSTRACT 

We  address  the  problem  in  signal  classification  applications, 
such  as  automatic  speech  recognition  (ASR)  systems  that 
employ  the  hidden  Markov  model  (HMM),  that  it  is  neces¬ 
sary  to  settle  for  a  fixed  analysis  window  size  and  a  fixed 
feature  set.  This  is  despite  the  fact  that  complex  signals 
such  as  human  speech  typically  contain  a  wide  range  of  sig¬ 
nal  types  and  durations.  We  apply  the  probability  density 
function  (PDF)  projection  theorem  to  generalize  the  hidden 
Markov  model  (HMM)  to  utilize  a  different  features  and  seg¬ 
ment  length  for  each  state.  We  demonstrate  the  algorithm 
using  speech  analysis  so  that  long-duration  phonemes  such 
as  vowels  and  short-duration  phonemes  such  as  plosives  can 
utilize  feature  extraction  tailored  to  the  their  own  time  scale. 


1.  INTRODUCTION 

The  Hidden  Markov  Model  (HMM)  [1]  combined  with 
spectral  analysis  using  cepstral  coefficients  [2]  on  fixed- 
length  analysis  windows  remains  at  the  forefront  of  auto¬ 
matic  speech  recognition  (ASR)  technology.  One  problem 
with  this  architecture  is  the  necessity  of  using  a  fixed  anal¬ 
ysis  window  size.  This  constraint  is  a  problem  because  in 
speech  and  other  natural  processes,  the  various  phenomena 
that  are  being  tested  (such  as  phonemes  in  speech)  may  occur 
with  differing  time  scale.  The  window  size  used  on  speech 
analysis  is  a  compromise  between  phonemes  with  long  time 
scale  such  as  vowels  and  phonemes  with  shorter  time  scales 
such  as  plosives.  The  need  for  a  fixed-size  window  arises 
from  the  fundamental  probabilistic  approach  that  underlies 
the  method  and  depends  on  the  comparison  of  likelihood 
functions  formed  on  a  common  feature  space.  One  could  not 
directly  compare  two  likelihood  functions  if  they  are  defined 
on  different  feature  spaces.  Even  if  pains  are  taken  to  normal¬ 
ize  the  behavior  of  similar  features  obtained  from  differing- 
size  data  windows,  the  fundamental  basis  for  comparison  is 
suspect. 

With  the  introduction  of  the  class-specific  feature  theo¬ 
rem  [3],  [4],  [5],  and  later  the  probability  density  function 
(PDF)  projection  theorem  (PPT)  [6],  the  freedom  now  exists 
to  use  a  different  feature  set  for  each  class,  even  for  each  state 
in  a  HMM  [7],  and  as  we  now  show,  different  analysis  win¬ 
dow  lengths  for  each  state.  Thus,  the  topic  of  this  paper  is  to 
apply  the  PPT  to  the  problem  of  using  varying-size  analysis 
windows  within  the  framework  of  a  HMM. 


2.  THE  HMM  AND  MULTI-RESOLUTION  HMM 
(MRHMM)  ON  RAW  DATA 

We  assume  familiarity  with  hidden  Markov  models  (HMMs). 
A  good  reference  is  an  article  by  Rabiner  [1]  from  which 
we  borrow  notation.  If  we  ignore  the  effects  of  overlapped 
processing,  the  underlying  assumption  when  a  time-series  is 
segmented  for  processing  is  that  the  data  in  two  different 
segments  are  conditionally  statistically  independent  (CSI).  In 
other  words,  the  data  in  two  segments  are  statistically  inde¬ 
pendent  conditioned  on  the  system  states  in  the  two  segments 
being  known.  The  CSI  property  enables  the  efficient  calcula¬ 
tion  of  the  joint  PDF  using  the  forward  procedure.  Let  there 
be  a  raw  data  time-series,  denoted  by  X,  consisting  of  an  in¬ 
teger  multiple  of  T  samples,  where  T  is  the  basic  time  quan¬ 
tization.  The  traditional  approach,  which  we  describe  simply 
as  the  HMM,  is  to  divide  the  data  into  uniform  T -sample  seg¬ 
ments  which  are  to  be  processed  separately.  Let  x,  represent 
the  data  in  time-step  t  consisting  of  data  samples  1  +  (t  —  1  )T 
through  t  T.  In  the  HMM,  it  is  assumed  that: 

1 .  during  any  T -sample  segment,  the  data  is  governed  by 
one  of  M  possible  states. 

2.  any  two  samples,  no  matter  how  close  together  ,  that  are 
contained  in  two  different  segments,  are  CSI. 

For  the  MRHMM,  however,  we  assume  that: 

1 .  during  any  T -sample  segment,  the  data  is  governed  by 
one  of  M  possible  states. 

2.  for  each  state  s  ,  there  is  an  associated  minimum  time 
duration.  Once  the  system  transitions  to  state  s,  it  must 
remain  in  that  state  for  nKsT  samples,  where  Ks  is  the 
integer  minimum  duration  parameter  for  state  s,  and  n  > 
1. 

3.  Two  data  samples  x,  and  xj  are  assumed  to  be  CSI  and  are 
processed  separately  if  (a)  the  system  has  made  at  least 
one  state  transition  between  times  i  and  j,  or  (b)  the  sys¬ 
tem  has  been  continually  in  the  same  state  s  but  samples 
i  and  j  are  in  different  length- ATV7^  segments.  Otherwise, 
samples  x,  and  xj  are  processed  jointly. 

4.  To  allow  for  the  system  being  in  a  state  for  a  number  of 
length-7’  segments  not  divisible  by  Ks,  we  define  a  num¬ 
ber  of  slave  states,  say  states  s',  s" ,  that  are  slaved  to  state 
,v  in  a  way  to  be  described,  with  Ks  >  Ksi  >  Ksn . 

Let  Q  =  [si,S2  ■  ■  -Siv]  be  a  set  of  state  values,  where  1  < 
st  <M,  1  <  t  <  N.  We  call  Q  a  trajectory  because  it  defines 
one  of  the  many  paths  through  the  state  diagram  or  trellis. 
Let  p(X|<2)  be  the  likelihood  function  of  the  raw  data  given 
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the  trajectory  Q.  Using  the  CSI  property,  we  may  write 

N 

HMM:  p(X|g)  =P[  p(xf|s,).  (1) 

r=i 

There  are  no  restrictions  on  Q  except  for  those  restrictions 
imposed  by  the  initial  state  probabilities  n=  {nm,  1  <  m  < 
M},  and  the  state  transition  matrix  (STM)  A  =  {A/  1  <  /  < 

M,  1  <  in  <  M}.  For  the  MRHMM,  we  can  encode  all  the 
above  restrictions  imposed  on  state  transitions  by  properly 
structuring  n  and  A.  For  each  state  s,  we  can  define  a  parti¬ 
tion  of  states,  which  we  call  wait  states,  of  size  Ks.  Let  Ae 
be  the  expanded  MRHMM  STM  and  let  ne  be  the  expanded 
set  of  prior  state  probabilities.  We  structure  Ae  so  that  state 
transitions  into  the  state  .v  partition  are  only  allowed  into  the 
first  wait  state.  From  the  first  wait  state,  the  state  is  forced 
to  increment  to  the  second,  third,  ...  and  finally  to  wait  state 
Ks.  From  wait  state  Ks,  the  state  is  allowed  to  transition  to 
the  first  wait  state  of  any  state  partition.  Note  that  although 
Ae  is  dimension  Me  x  Me  where  Me  =  Y.m=\  Km,  there  are 
only  M 2  free  parameters  in  Ae. 

At  this  point,  the  MRHMM  can  be  seen  as  nothing  more 
than  a  HMM  with  a  specially  structured  n  and  A.  But  the 
more  important  difference,  which  we  will  explain  below,  is 
in  the  way  that  p(X.\Q)  is  calculated.  For  the  moment,  let 
us  talk  about  our  goal.  We  seek  an  algorithm  to  solve  the 
following  four  problems: 

1 .  Segmentation.  Find  the  most  likely  trajectory  through  the 
trellis  subject  to  the  restrictions  described  above. 

2.  State  probabilities.  Determine  the  a  posteriori  state  prob¬ 
abilities  jtM  =  p{st  =  m|X).  This  is  a  more  complete  de¬ 
scription  of  the  trajectories  than  knowing  the  single  most 
likely  trajectory. 

3.  Joint  PDF.  The  joint  likelihood  function  of  all  the  data 
given  the  model  is  given  by 

L(x)=  £  p(X\Q)p(Q),  (2) 

Qe£ 

where  .3  is  the  set  of  all  possible  trajectories  and  P{Q)  is 
the  a  priori  probability  of  a  given  trajectory  through  the 
trellis.  Note  that  L(X)  averages  p(X|<2)  over  all  trajec¬ 
tories  through  the  trellis  weighted  by  the  probability  of 
the  trajectory.  Invalid  trajectories  have  zero  contribution. 

4.  Re-estimation.  We  would  like  to  estimate  the  model  pa¬ 
rameters  from  the  data.  Parameters  include  n.  A,  and  the 
parameters  0S  of  the  conditional  state  PDFs  /;(x,  |,v.  0S). 

For  the  HMM,  the  above  problems  are  solved  by  the  forward 
procedure  and  the  associated  backward  procedure  and  the 
Baum- Welch  algorithm  [1],  For  the  MRHMM,  we  need  to 
adapt  these  algorithms,  not  only  by  structuring  the  7Z  and  A, 
but  by  changing  the  way  that  p(X|g)  is  calculated.  We  will 
explain  by  example.  Let  the  first  state  partition  be  length  3 
(K\  =  3)  and  let  the  partition  for  state  ,v  =  1  consist  of  the 
wait  states  q  =  1,  q  =  2,  and  q  =  3.  Let 

Q  =  [4,6,7,1,2,3,4,5,6,7,10,11,12,5,6,7,1,2,3,10] 

be  a  particular  valid  length-20  state  trajectory.  Being  a  valid 
trajectory,  wait  states  q  =  1  through  q  =  3  occur  only  as  part 
of  the  sequence  1,2,3.  Here  is  the  point  at  which  the  HMM 
and  MRHMM  differ.  For  the  HMM,  we  have 

p(X|Q)  =  p(x i  \qi  =  4)  •  •  •  p(x2o|<?20  =  10).  (3) 


We  can  gather  all  the  state  PDF  values  into  the  matrix  P,  q  = 
p{xt\q).  Then,  (2)  may  be  computed  by  the  well  known/or- 
ward  procedure  [1]  operating  on  Ptq  and  using  parameters  K 
and  A.  To  change  the  HMM  into  a  MRHMM,  we  need  two 
steps: 

Step  1.  Partial  PDF  values.  For  each  valid  trajectory 
Q  and  each  state  s,  collect  all  terms  in  p(X|<2)  associated 
with  the  wait  state  sequence  for  state  partition  s  and  re¬ 
place  the  terms  by  the  partial  PDF  value.  Define  p  ( xfs  ,v)  = 
nfij  p(x, +(-_i  | qi),  where  q\  ...qKs  is  the  sequence  of  wait 


states  in  the  state  s  partition.  Define 


as  the 


partial  PDF  value  (the  geometric  mean  of  the  PDF  terms  in 
the  sequence).  In  the  above  example,  the  first  occurrence  of 
the  wait  state  sequence  1,2,3  is  the  sequence  of  terms 


p{x4\q4  =  l)p(x5|<?5  =2)p(x6|<76  =  3), 

which  we  denote  by  ^(x^ls  =1).  We  replace  each  of  the 
three  PDF  factors  by  the  partial  PDF  value  [/^x^s  =  l)]1/3. 
This  substitution  does  not  change  the  value  of  p(X.\Q). 

Note  that  we  can  accomplish  this  by  changing  Ptq.  Asso¬ 
ciated  with  every  possible  occurrence  of  partition  s  sequence 
is  a  diagonal  line  in  matrix  Ptq  of  length  Ks.  The  diagonal 
starts  with  the  first  wait  state  of  partition  s  at  any  time  t  and 
ends  with  the  last  wait  state  of  partition  s  at  time  t  +  KS  —  1 . 
Each  such  sequence  is  replaced  with  the  geometric  mean  as 
described.  The  resulting  matrix  is  called  the  partial  PDF  ma¬ 
trix  Pfq.  Note  that  applying  the  forward  procedure  to  Pfq 
gives  precisely  the  same  result  as  Ptq  provided  7Z  and  A  re¬ 
flect  the  restrictions  to  state  transitions  that  were  described 
above.  Matrix  Pfq  if  viewed  as  an  image  appears  to  have 
diagonal  “streaks”  of  constant  value. 

Step  2.  Relax  CSI  assumption.  Associated  with  each 
streak  is  a  data  analysis  window  xfs,  which  is  a  segment  of 
data  of  length  KST  samples  ending  at  sample  (t  +  Ks  1)7’. 
The  product  of  each  streak  of  partial  PDF  values  is  a  PDF  of 
the  data  analysis  window  assuming  CSI  segments.  To  relax 
the  CSI  assumption  within  the  streak,  we  replace  the  partial 
PDF  values  with  the  Ks  root  of  the  analysis  window  PDF  pro- 

f 

cessed  as  a  unit.  Let  P{q  represent  the  “full  window”  partial 
PDF  values  created  this  way.  The  value  of  /,(X)  calculated 

f 

by  the  forward  procedure  operating  on  P{q  changes,  however, 
it  remains  a  valid  joint  PDF  of  X.  We  know  this  because  all 
we  have  done  is  replace  the  the  conditional  PDFs  P(X.\Q)  as¬ 
suming  all  the  segments  are  independent  with  another  PDF 
that  assumes  statistical  dependence  within  the  wait  state  se¬ 
quences  associated  with  a  given  state. 

At  this  point  we  have  a  raw-data  based  MRHMM  model 
that  we  can  compute  efficiently  using  the  forward  procedure 
operating  on  pfq.  To  create  a  feature-based  MRHMM  model, 
we  need  only  to  apply  the  PPT. 


3.  CLASS-SPECIFIC  MULTI-RESOLUTION 
CLASS-SPECIFIC  (CS-MRHMM) 

The  standard  feature-based  HMM  is  the  same  as  the  raw-data 
based  HMM  with  the  raw  data  segments  x,  replaced  by  the 
feature  vector 


Z  =  {zuz2...zN},  z,  =T(x,). 


With  this  simple  replacement,  the  forward  procedure  com¬ 
putes  the  feature-based  likelihood  function 

L(Z)=  £  P(Z\Q)P(Q),  (4) 

Qe£ 

For  the  CS-MRHMM,  we  need  to  use  the  PPT  to  transi¬ 
tion  to  the  feature  domain.  Let 

xf  =  [*i+(f-i)r---*(f+A:-i)r]> 

be  the  length  KT  sample  analysis  window  which  starts  at 
sample  1  +  (t  —  1 )  T.  It  includes  segments  x,  through  xt+K- 1  • 
The  term  p(xf  |s)  will  be  calculated  using  the  PDF  projec¬ 
tion  theorem  [6],  As  we  have  written  several  publications 
on  the  topic  including  the  tutorial  article  [8],  we  describe  the 
method  only  briefly.  Let  x  be  a  general  segment  of  raw  time- 
series  data.  Let  zs  =  7j(x)  be  a  feature  set  calculated  from  x 
specifically  designed  for  state  s.  Let  p(zs\s)  be  a  PDF  esti¬ 
mate  of  the  feature  set  zs  based  on  training  data  from  state  ,v. 
The  feature  likelihood  function  is  projected  from  the  feature 
space  to  the  raw  data  by  pre-multiplying  by  the  J-function  as 
follows: 

Pp(x|s)  =J(x.;Tm,H0j)  p(zs|s).  (5) 


The  amount  of  processing  required  can  be  mitigated,  by  re¬ 
cursive  processing.  For  example,  the  FFT  or  autocorrelation 
function  (ACF)  of  a  segment  can  be  updated  to  reflect  data 
that  has  been  shifted  out  and  data  that  has  been  shifted  in 
[9],  Applying  the  MRHMM  to  real  data  warrants  additional 
details  beyond  what  has  been  so  far  described. 

4.1  States  vs.  Signal  classes 

Let  signal  class  refer  to  a  particular  signal  phenomenon  ob¬ 
served  in  the  data.  Let  signal  state  refer  to  an  instance  of  a 
signal  class.  In  the  simplest  situation,  signal  class  and  signal 
state  are  synonymous.  But,  if  a  signal  class  is  observed  to 
repeat,  additional  signal  states  may  be  used  to  represent  the 
additional  occurrences.  These  in  turn  give  rise  to  additional 
partitions. 

4.2  Slave  Partitions 

We  have  already  introduced  the  notion  of  slave  partitions 
(slave  states).  Up  to  now,  this  has  only  meant  the  necessity 
of  adding  additional  states  with  lower  K.  To  compute  L(X) 
with  the  forward  procedure,  there  is  nothing  else  that  needs 
to  be  done.  However,  to  train  the  parameters,  we  will  need  to 
discuss  the  process  of  ganging  states. 


The  function  pp(xjs)  can  be  regarded  as  a  function  only  of  x 
by  substituting  7j(x)  for  zs  and  can  be  shown  to  integrate  to  1 
over  x  (thus  it  is  a  PDF).  The  J-function  is  a  unique  function 
of  x  determined  precisely  from  the  feature  transformation  Ts 
and  the  class-dependent  reference  hypothesis  Hq  s: 


J(x-,Ts,Ho,s ) 


p{x\Hq,s) 

p(zs\H0ts) ' 


(6) 


Since  J(x\Ts,Hq>s)  is  determined  a  priori  without  regard 
to  training  data,  it  can  be  considered  the  untrained  part  of 
pp(x|s),  while  p(zs |s)  is  the  trained  part. 

While  it  is  true  that  pp(x|s)  is  a  PDF,  it  is  only  an  esti¬ 
mate  of  p(x|s).  The  degree  to  which  pp(x|s)  is  a  good  esti¬ 
mate  of  p(x|s)  depends  on  (a)  the  accuracy  of  p(zs\s)  and  (b) 
the  degree  to  which  zs  is  a  sufficient  statistic  for  the  binary 
hypothesis  test  between  s  and  Hq  s.  In  the  rare  case  that  zs  is 
in  fact  a  sufficient  statistic,  the  accuracy  of  pp(x|s)  depends 
only  upon  the  accuracy  of  the  low-dimensional  PDF  estimate 
p(zs\s).  The  J-function  takes  many  forms  [6],  one  of  which 
can  be  used  when  zs  are  maximum  likelihood  (ML)  estimates 
of  a  set  of  parameters.  In  this  case,  /(x;  Ts,Hq  s )  has  a  simple 
form  based  on  the  Fisher’s  information  matrix  [6], 


4.  PRACTICAL  IMPLEMENTATION  DETAILS 

Let  us  briefly  review  what  we  have  done  so  far.  We  have 
described  how  to  compute  the  likelihood  function  of  the  CS- 
MRHMM.  To  do  this,  we  identify  every  time-shifted  analy¬ 
sis  window.  For  each  state  s  and  time  step  t,  we  identify  the 
analysis  window  that  starts  at  time  step  t  and  is  of  length  KST 
samples.  On  this  window,  we  extract  the  state-dependent  fea¬ 
ture  set,  then  use  the  PPT  to  compute  the  raw-data  PDF  of 
the  analysis  window.  We  then  take  the  Ks  root  of  the  PDF 
value  and  insert  this  value  into  the  length  Ks  diagonal  streak 
in  the  matrix  p[q.  Then,  we  apply  the  well-known  forward 
procedure  using  the  expanded  parameters  ne,  Ae.  When  Ks 
is  large,  this  requires  a  highly  overlapped  set  of  windows. 


4.3  Training  the  CS-MRHMM. 

In  the  standard  Baum- Welch  algorithm  for  re-estimation  of 
HMM  parameters  [1],  the  state  feature  PDFs  for  state  s  are 
trained  by  maximizing  log-likelihood  functions  weighted  by 
ys j.  Since  the  standard  HMM  does  not  differentiate  between 
wait  states,  we  would  need  to  a  separate  PDF  estimate  for 
each  wait  state.  However,  for  the  CS-MRHMM,  there  are 
only  PDF  estimates  associated  with  the  initial  wait  states, 
the  first  wait  states  of  each  partition.  Logically,  the  CS- 
MRHMM  produces  values  of  yq>,  that  are  constant  in  diag¬ 
onal  streaks  in  a  partition.  That  is,  yqJ  =  Jq+  \  _t+  \  if  wait 
states  q  and  q  +  1  are  in  the  same  partition.  Thus,  in  the  CS- 
MRHMM,  each  analysis  window  can  be  traced  to  a  given 
constant-valued  streak  in  the  Jq.t  matrix.  When  training  the 
CS-MRHMM,  the  features  from  the  associated  analysis  win¬ 
dow  are  weighted  by  the  corresponding  value  of  yqJ  in  the 
streak.  Training  becomes  slightly  more  complicated,  how¬ 
ever,  once  we  consider  slave  partitions  and  if  the  number  of 
signal  states  exceeds  the  number  of  signal  classes.  While 
each  partition  is  associated  with  a  PDF  estimate,  we  may 
not  want  all  partition  PDF  estimates  to  be  independent.  To 
remedy  this  situation,  we  “gang  together”  all  partitions  that 
associate  with  a  given  signal  class.  To  gang  partitions,  we 
first  create  a  compressed  version  of  yqJ ,  denoted  by  frH ,  , 
which  sums  yq  t  over  all  wait  states  associated  with  signal 
class  m.  Then  we  then  weight  an  analysis  window  by  the 
smallest  value  of  y1Lr  in  the  set  of  time  steps  t  contained  in 
the  analysis  window.  This  works  very  well  in  practice  but  is 
a  clear  departure  from  the  Baum- Welch  algorithm  and  may 
produce  an  algorithm  without  guaranteed  monotonicity. 

4.4  Efficient  Implementation 

The  number  of  wait  states  in  the  expanded  HMM  problem 
can  be  very  large.  The  forward  and  backward  procedures 
have  a  complexity  of  the  order  of  the  square  of  the  number 
of  states.  Thus,  an  efficient  implementation  of  the  forward 
and  backward  procedures  and  Baum- Welch  algorithm  may 


be  needed  that  takes  advantage  of  the  redundancies  in  the 
expanded  problem.  We  have  obtained  a  time  reduction  factor 
of  42  with  a  problem  that  had  7  signal  classes  and  expanded 
to  274  wait  states.  The  two  algorithms  were  tested  to  produce 
the  same  results  within  machine  precision. 

5.  EXAMPLES 
5.1  Simulated  Data 

To  illustrate  the  concepts,  we  tested  the  concept  of  the  CS- 
MRHMM  using  simulated  data.  To  independent  identically 
distributed  (iid)  Gaussian  noise,  we  added  a  low  frequency 
(LF)  pulse  of  autoregressive  (AR)  process  of  128  samples  in 
length  with  a  peak  frequency  response  of  0.4  radians  per  sam¬ 
ple,  followed  by  a  random-length  gap  of  at  least  256  samples, 
followed  by  high  frequency  (HF)  pulse  of  AR  process  of  64 
samples  with  a  peak  frequency  response  of  1.2  radians  per 
sample.  An  example  of  the  signal  and  noise  is  shown  in  Fig¬ 
ure  1.  We  implemented  the  HMM  with  three  signal  states. 


Figure  1:  Example  of  spectrogram  of  synthetic  data.  The 
data  consists  of  three  signal  classes.  Class  1  (noise)  occurs 
first,  then  a  low-frequency  pulse  of  duration  128  samples, 
then  noise,  then  a  high-frequency  pulse  of  duration  64  sam¬ 
ples. 

each  corresponding  to  a  signal  class  :  “noise”,  “LF  pulse”, 
and  “HF  pulse”.  We  used  nine  partitions  including  six  slave 
partitions.  The  elemental  segment  length  was  T  =  32  sam¬ 
ples.  There  were  a  total  of  25  wait  states.  Parameters  of  the 
nine  partitions  are  listed  in  table  1.  Autoregressive  (LPC) 


Partition 

Signal  class 

KT 

K 

P 

1 

Noise 

256 

8 

4 

2 

Noise 

128 

4 

4 

3 

Noise 

64 

2 

4 

4 

Noise 

32 

1 

3 

5 

LF  Pulse 

128 

4 

4 

6 

LF  Pulse 

64 

2 

4 

7 

LF  Pulse 

32 

1 

3 

8 

HF  Pulse 

64 

2 

4 

9 

HF  Pulse 

32 

1 

3 

Table  1:  Partition  parameters  for  the  illustrative  example. 
K  is  the  partition  length  in  elemental  segments.  KT  is  the 
length  of  the  partition  in  samples.  Parameter  P  is  the  autore¬ 
gressive  (AR)  model  order  (same  as  LPC  model  order). 

features  of  model  order  P  (see  table  1)  were  extracted  by 
overlapped  window  processing.  A  separate  feature  processor 
was  used  for  each  combination  of  K  and  P.  Features  were 
shared  between  partitions  that  had  the  same  K  and  P  values. 
Analysis  windows  were  shifted  always  by  the  elemental  seg¬ 
ment  length  of  32  samples  for  each  update,  so  the  amount  of 


overlap  depended  on  the  length  of  the  analysis  window.  To 
handle  end  effects,  data  was  assumed  to  wrap  around  in  time. 

Features  were  extracted  from  each  analysis  window  by 
first  taking  the  FFT,  computing  the  magnitude  squared,  then 
computing  the  inverse-FFT  to  produce  the  autocorrelation 
function  (ACF).  The  Levinson  algorithm  was  used  to  pro¬ 
duce  the  reflection  coefficients  of  order  P.  The  total  power  in 
window  is  also  stored  as  the  P  +  1st  feature.  The  J-function 
[6]  is  obtained  by  use  of  the  saddle -point  approximation  [10], 
Further  details  of  the  implementation  details  of  the  AR  mod¬ 
els  can  be  found  in  [8]. 

f 

In  Figure  2,  we  see  the  partial  PDF  matrix  Pjm  for  a  typ¬ 
ical  sample.  Wait  states  q  =  1  through  q  =  15  are  associated 
with  the  “Noise”  signal  class,  wait  states  q  =  16  through 
q  =  22  are  associated  with  the  “LF  Pulse”  signal  class,  and 
wait  states  q  =  23  through  q  =  25  are  associated  with  the 
“HF  Pulse”  signal  class.  The  gamma  probabilities  are  a  by- 


f 

Figure  2:  Partial  PDF  matrix  Pjm  showing  devisions  between 
signal  classes  (solid  horizontal  lines)  and  between  wait  state 
partitions  (dotted  lines).  Higher  probability  is  darker. 

product  of  the  Baum  Welch  algorithm  [1]  and  indicate  the 
relative  probability  of  each  wait  state  given  the  data.  The 
gamma  probabilities  corresponding  to  figure  2  are  shown  in 
Figure  3.  This  figure  can  be  interpreted  as  the  trajectory 
through  figure  3  that  pick  up  the  highest  probabilities  while 
meeting  the  restrictions  set  by  the  state  transition  matrix. 


Figure  3:  Wait  state  probabilities  for  illustrative  example. 

Note  that  the  wait  states  for  “LF  pulse”  (q  =  16  through 
q  =  19)  are  clearly  seen  where  the  pulse  occurs.  The  same 
is  true  of  the  “HF  pulse”  event  (q  =  23  through  q  =  24)  . 
It  is  possible  to  see  various  competing  trajectories  through 
the  trellis.  Note  for  example  in  time  steps  43  through  56, 
the  noise  gap  between  the  two  pulses,  the  HMM  is  in  the 
noise  signal  class.  In  steps  43-50,  it  is  in  partition  1,  (wait 
states  q  =  1  through  q  =  8).  Then  after  exiting  wait  state 
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Figure  4:  Example  of  CS-MRHMM  operating  on  the  word  “stool”.  From  top  to  bottom:  compressed  gamma  probabilities 
Ym  t,  log  signal  power,  and  spectrogram.  Short  analysis  windows  have  been  employed  for  the  “T”,  while  longer  processing  has 
been  used  for  background  noise  and  the  sounds  “S”,  “oo”  and  “L”.  The  three  components  of  the  “T”  can  be  clearly  identified. 


<7  =  4,  it  has  located  two  possibilities  to  span  the  six  time 
steps  remaining  before  HF  pulse  occurs.  It  can  either  go  into 
partition  2  (wait  states  q  =  9  through  q  =  12)  then  partition 
3  (wait  states  q  =  13  through  q  =  14),  or  it  can  choose  the 
reverse,  partition  3  then  partition  2. 

The  gamma  probabilities  can  be  collapsed  to  indicate  just 
the  signal  classes,  as  shown  in  Figure  5.  The  class  probabili- 


Figure  5:  Signal  class  probabilities  calculated  by  summing 
figure  3  over  the  wait  states  of  each  class. 


ties  (figure  5)  is  an  accurate  indication  of  the  true  content  of 
the  data  to  a  time  resolution  of  T  =  32  samples. 

5.2  Speech  Data 

We  used  the  CS-HMM  to  analyze  the  spoken  word  “stool” 
at  16  kHz  sample  rate.  Space  restrictions  do  not  permit  a 
detailed  description  of  the  experiment.  We  identified  seven 
signal  classes  and  assigned  values  of  K  and  P  (LPC  order)  (1) 
Noise  used  for  both  background  and  the  “T”  closure:  K  =  12 
or  384  samples,  P  =  7,  (2)  “S”  :  K  =  12  or  384  samples, 
P  =  7,  (3)  “T”  Burst :  K  =  4  or  128  samples,  P  =  5,  (4)  “T” 
Aspiration  :  K  =  8  or  256  samples,  P  =  6,  (5)  “oo”  vowel 
part  1:  K  =  24  or  768  samples,  P  =  8,  (6)  “oo”  vowel  part 
2:  K  =  24  or  768  samples,  P  =  8,  (7)  “L”  :  K  =  24  or  768 
samples,  P  =  8.  After  adding  slave  partitions,  we  had  a  total 
of  36  partitions  and  a  total  of  258  wait  states.  The  expanded 
STM  was  258  by  258.  Using  efficient  programming,  neither 
the  partial  probability  matrix  nor  the  expanded  STM  actually 
need  to  be  created.  Figure  4  shows  the  result  of  analysis 
of  one  example  with  the  CS-MRHMM.  Important  to  note  is 
that  the  three  components  of  the  “T”  can  be  clearly  seen  by 
observing  fm  t. 


REFERENCES 

[1]  L.  R.  Rabiner,  “A  tutorial  on  hidden  Markov  models 
and  selected  applications  in  speech  recognition,”  Pro¬ 
ceedings  of  the  IEEE ,  vol.  77,  pp.  257-286,  February 
1989. 

[2]  J.  W.  Picone,  “Signal  modeling  techniques  in  speech 
recognition,”  Proceedings  of  the  IEEE ,  vol.  81,  no.  9, 
pp.  1215-1247, 1993. 

[3]  P.  M.  Baggenstoss,  “Class-specific  features  in  classifi¬ 
cation.,”  in  / ASTED  International  Conference  on  Signal 
and  Image  Processing,  1998. 

[4]  S.  Kay,  “Sufficiency,  classification,  and  the  class- 
specific  feature  theorem,”  IEEE  Trans.  Information 
Theory ,  vol.  46,  pp.  1654-1658,  July  2000. 

[5]  P.  M.  Baggenstoss,  “Class-specific  features  in  classi¬ 
fication.,”  IEEE  Trans  Signal  Processing,  pp.  3428- 
3432,  December  1999. 

[6]  P.  M.  Baggenstoss,  “The  PDF  projection  theorem  and 
the  class-specific  method,”  IEEE  Trans  Signal  Process¬ 
ing,  pp.  672-685,  March  2003. 

[7]  P.  M.  Baggenstoss,  “A  modified  Baum-Welch  algo¬ 
rithm  for  hidden  Markov  models  with  multiple  observa¬ 
tion  spaces.,”  IEEE  Trans.  Speech  and  Audio,  pp.  41 1- 
416,  May  2001. 

[8]  P.  M.  Baggenstoss,  “The  class-specific  classifier: 
Avoiding  the  curse  of  dimensionality  (tutorial),”  IEEE 
Aerospace  and  Electronic  Systems  Magazine,  special 
Tutorial  addendum,  vol.  19,  pp.  37-52,  January  2002. 

[9]  P.  Baggenstoss,  “Time-series  segmentation,”  United 
States  Patent  6907367,  June  2005. 

[10]  S.  M.  Kay,  A.  H.  Nuttall,  and  P.  M.  Baggenstoss,  “Mul¬ 
tidimensional  probability  density  function  approxima¬ 
tion  for  detection,  classification  and  model  order  selec¬ 
tion,”  IEEE  Trans.  Signal  Processing,  pp.  2240-2252, 
Oct  2001. 


